This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
The data preparation involves the following steps, detailed in the following tabs:
Load packages: installs and load the packages required for the scripts
Load data: imports datasets
Clean data: this section contains ad hoc data cleaning, i.e. which is not used in other reports (otherwise cleaning should be done in a dedicated report); this section is also used to create new variables used in the analyses
This code chunk shows the loading of packages required for the analyses.
# Create vector of names of required packages:
packages_epicurve <- c("rio", # File import
"here", # File locator
"tidyverse", # data manipulation
"ggplot2", # Produce plots and graphs
"aweek", # working with dates
"lubridate", # Manipulate dates
"incidence", # an option for epicurves of linelist data
"stringr", # Search and manipulate character strings
"forcats", # working with factors
"RColorBrewer", # Color palettes from colorbrewer2.org
"DT" # produce tables for this html handbook
) ### close vector of packages
# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(packages_epicurve, character.only=TRUE)Two example datasets are used in this section:
If viewing in Google Chrome, you can access these datasets in Microsoft Excel by clicking HERE and HERE. TODO.
The datasets are imported using the import() function from the rio package. See the page on importing data for various ways to import data. Each data set is displayed below as a table for viewing.
For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.
Review the two datasets and notice the differences
Linelist dataset
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Aggregated counts dataset
You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). In this case, we set this date as 27 July 2013.
Now we can reference the object data_date into the code and have it reference that date.
Optionally, it can be nice to identify all the date variables and store their names in a vector. This can be done by individually naming them, or by searching for them by looking for keywords.
Method 1
# Method 1. Define date variables explicitly in a vector
DateVars <- c("date_of_onset",
"date_of_hospitalisation",
"date_of_outcome"
)
DateVars
## [1] "date_of_onset" "date_of_hospitalisation"
## [3] "date_of_outcome"Method 2
# Method 2: Search for date columns
DateVars <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))
DateVars
## [1] "date_of_onset" "date_of_hospitalisation"
## [3] "date_of_outcome"
#Note: other search tool options within vars_select include contains() ends_with(), starts_with(), or one_of()Convert the date variables to class “date”. There are a few options that use different packages. Each is explained in the code chunk.
# Method 1. Manually convert each variable (allows flexibility in format)
linelist$date_of_onset <- as.Date(linelist$date_of_onset, format = "%m/%d/%Y")
linelist$date_of_hospitalisation <- as.Date(linelist$date_of_hospitalisation, format = "%m/%d/%Y")
linelist$date_of_outcome <- as.Date(linelist$date_of_outcome, format = "%m/%d/%Y")
# Method 2. Use clean_dates() on the dataset to clean date variables
#linelist <- linelist %>%
# linelist::clean_dates(error_tolerance = 0.1)
# Note: Use guess_dates() function from the linelist package addresses messy dates (different formats within a variable)Verify that each variable was successfully converted to date class by printing statistics and a quick histogram for each one.
# To verify successful conversion of date variables
# Creates list of column numbers of date variables
varNums <- c()
for (varName in DateVars) {
varNum <- match(varName, names(linelist))
varNums <- c(varNums, varNum)
}
# Produce output for each date variable converted
for (varNum in varNums) {
varName <- names(linelist)[varNum] # get name of variable
class <- class(linelist[, varNum]) # get class
missing <- sum(is.na(linelist[, varNum])) # get number missing values
hist(linelist[, varNum], # histogram
breaks = 50,
main = paste0("Histogram of: ", varName, ", Class: ", class, ", Missing: ", missing),
xlab = varName)
}incidence packageBelow are tabs on using the “incidence” package
This section shows variations on the epicurve using the incidence package
These are simple epicurves using the incidence package. The epicurve is assigned to the object “epicurve”, which is then plotted. Remember that incidence::plot() is different to base::plot()
The interval defines how the observations are grouped. Options are all those in the package aweek, including but are not limited to:
* “Monday week” * “2 Monday weeks” * “Sunday week”
* “MMWRweek” (starts on Sunday)
* “Month”
* “Quarter”
* “Year”
First date and last date can also be specified.
# incidence object is created, with data aggregated at one day intervals
epicurve_daily <- incidence::incidence(linelist$date_of_onset, interval = "day")
# If weekly, you can specific the start day
epicurve_weekly <- incidence::incidence(linelist$date_of_onset, interval = "Monday week")
epicurve_3weekly <- incidence::incidence(linelist$date_of_onset, interval = "3 weeks")
# Monthly
epicurve_monthly <- incidence::incidence(linelist$date_of_onset, interval = "month")
# Plot the incidence object
plot(epicurve_daily)
plot(epicurve_weekly)
plot(epicurve_3weekly)
plot(epicurve_monthly)Behind the scenes, incidence is using ggplot(), so you can add aesthetic themes and other lines using the ggplot syntax.
# Set theme elements using ggplot syntax
epicurve_theme <- ggplot2::theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
legend.title = element_blank(),
panel.grid.major.x = element_line(color = "grey60", linetype = 3),
panel.grid.major.y = element_line(color = "grey60", linetype = 3)
)
# Sets labels using ggplot syntax
epicurve_labels <- labs(x = "Week",
y = "Cases (n)",
title = "H5N7 cases by week of onset",
caption = paste0("Source: Linelist data from: ", data_date, "; ", missing_onset, " are missing date of onset and not shown."))
# plot the epicurve with aesthetics
nice_plot <- plot(epicurve_weekly, show_cases = TRUE, border = "black", n_breaks = nrow(epicurve_weekly)) +
scale_y_continuous(expand = c(0, 0)) + # set origin for axes
# add labels
epicurve_labels +
# add theme
epicurve_theme
nice_plot
# Modify nice_plot to show only 6 breaks in the x-axis
nice_plot + scale_x_incidence(epicurve_weekly, n_breaks = 6)Now differentiating the cases by gender, using the groups = argument in the incidence command
# Create epiweek object, with counts grouped by gender
epicurve_weekly_gender <- incidence(linelist$date_of_onset,
interval = "week",
groups = linelist$gender,
na_as_group = FALSE) # Prevents missing values from being assigned their own group
# Plot the epicurve
# Note: Remove the boxes around each case as it makes gender colours hard to see! (show_cases = FALSE)
nice_plot <- plot(epicurve_weekly_gender, show_cases = FALSE, border = "black", n_breaks = nrow(epicurve_weekly_gender)) +
# add labels (defined in previous section)
epicurve_labels +
# add theme elements
epicurve_theme
nice_plotTo filter data, This version is filtered to only show data from a specific province.
# filter the dataset and pass it to the incidence() function
Zhejiang_data <- filter(linelist, province == "Zhejiang")
epicurve_Zhejiang <- incidence(Zhejiang_data$date_of_onset,
interval = "week",
groups = Zhejiang_data$gender)
# Re-sets labels, changing title to reflect subset
epicurve_labels <- labs(x = "Week",
y = "Cases",
title = "H5N7 cases by week of onset in Zhejiang",
caption = paste0("Source: Linelist data from: ", data_date, "; ", missing_onset, " are missing date of onset and not shown."))
# plot as before
plot(epicurve_Zhejiang, show_cases = TRUE, border = "grey") +
# add labels (defined in previous section)
epicurve_labels +
# add theme elements
epicurve_themeggplot2Below are tabs on using “ggplot2” package
# Daily case counts
###################
plot_daily <- ggplot(linelist, aes(x = date_of_onset)) +
# stacked bars, bined by day (1 days)
stat_bin(binwidth = 1, position="stack")
print(plot_daily)
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_of_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position="stack", fill = "brown")
print(plot_weekly)# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_of_onset, factor = TRUE, numeric = TRUE)
# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_of_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
# X-axis 21-day labels
scale_x_date(
# Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_of_onset, na.rm = T), to = max(linelist$date_of_onset, na.rm=T), by = "1 week"),
# axis limits determined by max/min + buffer
limits = c((min(linelist$date_of_onset, na.rm = T) - 8), (max(linelist$date_of_onset, na.rm = T) + 8)),
# displays as date number, then abbreviated month (e.g. 12 Oct)
date_labels = "%d-%b",
# sets origin at (0,0)
expand = c(0,0)) +
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 5),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic")
) +
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "subtitle here",
caption = paste0(nrow(linelist),
" confirmed and probable cases, reported as of ", data_date, ". ",
missing_onset, " cases missing date of onset and not shown.")) +
ggtitle("Epidemic curve")
# print
print(plot_weekly)Colored by a category
# Setup
########
# Two known classes (select colors from colorbrewer2.org)
colors_overall = c("#d95f02", #
"#1b9e77",
"#7570b3") #
# Order sex variable by reverse # of cases, so plot stacks with smallest # of cases at top
linelist$gender <- factor(linelist$gender,
levels = levels(fct_rev(fct_infreq(linelist$gender))))
# Calculates maximum yaxis height for uniformity between the two graphs
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_of_onset),])
# PLOT - BY ONSET DATE
######################
plot_defined_cats <- ggplot(linelist, aes(x = date_of_onset, fill = gender)) +
# stacked bars, width of 7 days
stat_bin(binwidth = 7, position = "stack") +
# Colors and labels of confirmed/probable
scale_fill_manual(values = rev(colors_overall),
labels = str_to_sentence(levels(factor(linelist$gender)))) +
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_of_onset, na.rm = T), to = max(linelist$date_of_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_of_onset, na.rm=T)), (max(linelist$date_of_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases",
caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
# Title
ggtitle("Cases by week of illness onset")
# print
print(plot_defined_cats)# PARAMETERS
#############
# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_of_onset)))
# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_of_onset > (data_date - 90))])
active_zones
## [1] "Fujian" "Jiangxi" "Beijing" "Hebei" "Guangdong"
# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
filter(province %in% active_zones) %>%
group_by(province) %>%
summarise(cases = n())
order_table
## # A tibble: 5 x 2
## province cases
## <chr> <int>
## 1 Beijing 3
## 2 Fujian 5
## 3 Guangdong 1
## 4 Hebei 1
## 5 Jiangxi 6
# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)
# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))
# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
# Value assignments
# Non-active zones
linelist$active_zone == FALSE ~ "Non-active zones",
# All others are assigned their names, capitalized
TRUE ~ stringr::str_to_title(linelist$province)),
# Order of variable levels
levels = c(
# "Non-active zones" is first level
"Non-active zones",
# Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))
table(linelist$graph_zone, useNA = "ifany")
##
## Non-active zones Hebei Guangdong Beijing
## 120 1 1 3
## Fujian Jiangxi
## 5 6
# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1
# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
"#1b9e77", # turquoise green
"#ff7f00", # orange
"#ffff33", # yellow
"#6a3d9a", # purple
"#b15928", # brown
"#1f78b4", # blue
"#e31a1c", # red,
"#fb9a99", # pink
"#b2df8a", # light green
"#cab2d6", # light purple
"#a6cee3", # light blue
"#fdbf6f", # beige
"#33a02c" # green
)
# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))
# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_of_onset, fill = graph_zone)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack") +
# Fill of bars
scale_fill_manual(values = colors_linelist,
labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
# X-axis 21-day labels
scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_of_onset, na.rm = T), to = max(linelist$date_of_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_of_onset, na.rm = T) - 8), (max(linelist$date_of_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date number, then abbreviated month (e.g. 12 Oct)
expand = c(0,0)) + # sets origin at (0,0)
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 5),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"), # title size, font, bold
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.caption = element_text(hjust = 0, face = "italic")
) +
# Legend specifications
theme(legend.title = element_blank(), # No legend title
legend.position = c(0.20, 0.85), # placement of legend
legend.background = element_blank(), # legend background
legend.text = element_text(size=12)) + # legend text size
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "Health zones with cases in the last 42 days specified by color",
caption = paste0(nrow(linelist),
" confirmed and probable cases, reported as of ", data_date, ". ",
missing_onset, " cases missing date of onset and not shown.",
"\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
ggtitle("Epidemic curve by active health zones")
# print
plot_overall
#SETUP
#############
# Filter to health zone of interest
zone_data <- linelist
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_of_onset)))
# Assign health area groups (individual for HAs of interest, groups others together)
linelist$graph_areas <- factor(case_when(
linelist$province == "Shanghai" ~ "Shanghai",
linelist$province == "Jiangsu" ~ "Jiangsu",
linelist$province == "Zhejiang" ~ "Zhejiang",
TRUE ~ "Other (10)"
),
# Levels part of the factor function assigns order of appearance
levels = c(
"Other (10)",
"Shanghai",
"Jiangsu",
"Zhejiang"
)
)
# checks
table(linelist$graph_areas, useNA = "ifany")
##
## Other (10) Shanghai Jiangsu Zhejiang
## 29 33 28 46
# Color assignments
colors_needed <- length(unique(linelist$graph_areas, na.rm=T)) - 1 # number of colors needed
# list of colors
colors_aire = c("#a6cee3",
"#1f78b4",
"#b2df8a",
"#33a02c",
"#fb9a99",
"#e31a1c",
"#fdbf6f",
"#ff7f00",
"#cab2d6",
"#6a3d9a",
"#ffff99",
"#b15928"
)
# Reduce number of colors to only the number needed
colors_aire <- c("#cccccc", rev(colors_aire[1:colors_needed]))
# Plot of province
#####################################
plot <- ggplot(linelist, aes(x = date_of_onset, fill = graph_areas)) +
stat_bin(binwidth = 7, position="stack") +
scale_fill_manual(values = colors_aire, labels = str_to_sentence(levels(factor(linelist$graph_areas)))) +
scale_x_date(date_breaks = "1 week", date_labels = "%d-%b", limits = c((min(linelist$date_of_onset, na.rm = T) - 8), (max(linelist$date_of_onset, na.rm = T) + 8)), expand=c(0,0)) + # I used the date onset variable here so x axes will be the same
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
plot.caption = element_text(hjust = 0, face = "italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
axis.title = element_text(size = 14, face = "bold"),
legend.title = element_blank(),
legend.justification = c(0,1),
legend.position = c(0.05, 1),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 4)) +
labs(x="Week of illness onset",
y="Number of cases",
subtitle = "",
caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset.")) +
ggtitle("Cases of influenza, by province")
plot# Define the waves
##################
# zone_data <- filter(linelist, zone_de_sante == "mabalako")
#
# zone_data$wave <- case_when(
# zone_data$date_onset >= as.Date("2018-03-01") &
# zone_data$date_onset < as.Date("2018-10-25") ~ "Wave 1",
#
# zone_data$date_onset >= as.Date("2018-10-25") &
# zone_data$date_onset < as.Date("2019-02-01") ~ "Wave 2",
#
# zone_data$date_onset >= as.Date("2019-02-01") &
# zone_data$date_onset < as.Date("2019-09-15") ~ "Wave 3",
#
# zone_data$date_onset >= as.Date("2019-09-15") ~ "Wave 4",
#
# TRUE ~ NA_character_
# )
#
# table(is.na(zone_data$date_onset))
# table(zone_data$wave, useNA = "always")
#
#
# # Color assignments
# colors_needed <- length(unique(zone_data$wave, na.rm=T)) # number of colors needed
#
# # list of colors
# colors_aire = c("#a6cee3",
# "#1f78b4",
# "#b2df8a",
# "#33a02c",
# "#fb9a99",
# "#e31a1c",
# "#fdbf6f",
# "#ff7f00",
# "#cab2d6",
# "#6a3d9a",
# "#ffff99",
# "#b15928"
# )
#
# # Reduce number of colors to only the number needed
# colors_aire <- c(rev(colors_aire[1:colors_needed]))
#
#
# # Plot of health zone colored by wave
# #####################################
# plot_Mabalako <- ggplot(zone_data, aes(x = date_onset, fill = wave)) +
#
# stat_bin(binwidth = 7, position = "stack") +
#
# scale_fill_manual(values = rev(colors_aire), labels = str_to_sentence(levels(factor(zone_data$wave)))) +
#
# scale_x_date(date_breaks = "21 days", date_labels = "%d-%b",
# limits = c((min(zone_data$date_onset, na.rm = T) - 8), (max(zone_data$date_report, na.rm = T) + 8)), expand = c(0,0)) +
#
# scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
#
# theme(text = element_text(family = "Segoe Condensed"),
# axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
# axis.text = element_text(size = 12),
# axis.title = element_text(size = 14, face = "bold"),
# axis.line = element_line(colour = "black"),
#
# plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
# plot.caption = element_text(hjust = 0, face = "italic"),
#
# panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# panel.background = element_blank(),
#
# legend.title = element_blank(),
# legend.justification = c(0,1),
# legend.position = c(0.75, 0.98),
# legend.background = element_blank(),
# legend.text = element_text(size=12)) +
#
# guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 1)) +
#
# labs(x="Week of illness onset",
# y="Number of cases",
# subtitle = "",
# caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset and 16 excluded due to uncertain health zone of report.")) +
#
# ggtitle("Four waves of EVD in Mabalako health zone")
#
# plot_Mabalako
#
#
# # Produce table describing each wave
# ####################################
# table <- zone_data %>%
# select("aire_de_sante", "wave", "community_death", "date_onset", "cte_date", "epicasedef", "community_death", "contact_registered", "contact_surveilled") %>%
# group_by(wave) %>%
# summarise(first_onset = min(date_onset, na.rm = T),
# last_admission = max(cte_date, na.rm = T),
# n = n(),
# confirmed = sum(epicasedef == "confirmed"),
# community_deaths = paste0(sum(community_death == 1),
# " (", round(100*sum(community_death == 1)/confirmed),"%)"),
# reg_contacts = paste0(sum(contact_registered == "yes"),
# " (", round(100*sum(contact_registered == "yes")/confirmed),"%)"),
# surv_contacts = paste0(sum(contact_surveilled == "yes"),
# " (", round(100*sum(contact_surveilled == "yes")/confirmed),"%)"),
# top = paste(toupper(names(sort(table(aire_de_sante),decreasing=TRUE)[1:3])), collapse=", ",
# round(100*(sort(table(aire_de_sante),decreasing=TRUE)[1:3]/confirmed)), "%"),
# health_areas = paste(toupper(unique(aire_de_sante)), collapse=', ')
# )
#
# kable(table)Often you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.
This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.
Note: The incidence package does not support aggregate data
As before, we must ensure date variables are correctly classified.
# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$Date, # use the Date variable
week_start = "Monday", # epiweek begins on Monday
floor_day = TRUE, # only display year and week #
factor = TRUE) # expand to include all possible weeksggplot(data = count_data, aes(x = as.Date(epiweek), y = Cases, group = District, fill = District))+
geom_bar(stat = "identity")+
# LABELS for x-axis
scale_x_date(date_breaks = "1 month", # displays by month
date_labels = '%b%d\n%Y')+ #labeled by month with year below
# Choose color palette (uses RColorBrewer package)
scale_fill_brewer(palette = "Pastel1")+
# Theme specifications (axis, text, etc.)
theme(
# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic"))+
# labels
labs(x = "Week of report",
y = "Number of cases",
subtitle = "Cases aggregated by week and shown by district",
caption = "Data source: XXXXX")+
ggtitle("Epidemic curve of disease X in fictional location")Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.
In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.
This uses the linelist dataset
TODO not complete yet
library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
mutate(onset_epiweek = aweek::date2week(date_of_onset, floor_day = TRUE, factor = TRUE)) %>%
group_by(onset_epiweek) %>%
summarize(num_male = sum(gender == "m"),
num_female = sum(gender == "f"),
pct_male = round(100*(num_male / n())),
med_age = median(as.numeric(age), na.rm=T)
)
# remove pct and melt
linelist_week_melted <- linelist_week %>%
select(-c("pct_male", "med_age")) %>%
melt(id.vars = c("onset_epiweek"))
# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
linelist_week,
by = "onset_epiweek")
second_axis <- ggplot(linelist_week_melted,
aes(x = as.Date(onset_epiweek),
y = value, group = variable,
fill = variable)) +
# bars
geom_bar(stat = "identity")+
# Colors and labels of confirmed/probable
scale_fill_manual(values = c("blue", "red"),
labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
scale_color_manual(values = "black")+
scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_of_onset, na.rm = T), to = max(linelist$date_of_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_of_onset, na.rm=T)), (max(linelist$date_of_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases",
caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
# Title
ggtitle("Cases by week of illness onset")
second_axis
# print
print(plot_defined_cats)This tab should stay with the name “Resources”. Links to other online tutorials or resources.